20 research outputs found
Database integrated analytics using R : initial experiences with SQL-Server + R
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Most data scientists use nowadays functional or semi-functional languages like SQL, Scala or R to treat data, obtained directly from databases. Such process requires to fetch data, process it, then store again, and such process tends to be done outside the DB, in often complex data-flows. Recently, database service providers have decided to integrate “R-as-a-Service” in their DB solutions. The analytics engine is called directly from the SQL query tree, and results are returned as part of the same query. Here we show a first taste of such technology by testing the portability of our ALOJA-ML analytics framework, coded in R, to Microsoft SQL-Server 2016, one of the SQL+R solutions released recently. In this work we discuss some data-flow schemes for porting a local DB + analytics engine architecture towards Big Data, focusing specially on the new DB Integrated Analytics approach, and commenting the first experiences in usability and performance obtained from such new services and capabilities.Peer ReviewedPostprint (author's final draft
When and How to Apply Statistics, Machine Learning and Deep Learning Techniques
Machine Learning has become 'commodity' in engineering and experimental sciences, as calculus and statistics did before. After the hype produced during the 00's, machine learning (statistical learning, neural networks, etc.) has become a solid and reliable set of techniques available to the general researcher population to be included in their common procedures, far from the mysticism surrounding this field when only ML experts could solve modeling and prediction problems using such novel algorithms. But while knowledge on this field has settled among professionals, novice ML users still have trouble to decide when determined techniques could and should be applied to solve a given problem, sometimes ending with over-complicated solutions for simplistic problems, or complex problems partially solved by simplistic methods. This tutorial wants to introduce the most common techniques on statistical learning and neural networks, towards showing the proper techniques for each given scenario.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595).Peer ReviewedPostprint (author's final draft
Modeling cloud resources using machine learning
Cloud computing is a new Internet infrastructure paradigm where management optimization has become a challenge to be solved, as all current management systems are human-driven or ad-hoc automatic systems that must be tuned manually by experts. Management of cloud resources require accurate information about all the elements involved (host machines, resources, offered services, and clients), and some of this information can only be obtained a posteriori. Here we present the cloud and part of its architecture as a new scenario where data mining and machine learning can be applied to discover information and improve its management thanks to modeling and prediction. As a novel case of study we show in this work the modeling of basic cloud resources using machine learning, predicting resource requirements from context information like amount of load and clients, and also predicting the quality of service from resource planning, in order to feed cloud schedulers. Further, this work is an important part of our ongoing research program, where accurate models and predictors are essential to optimize cloud management autonomic systems.Postprint (published version
A resilient and distributed near real-time traffic forecasting application for Fog computing environments
In this paper we propose an architecture for a city-wide traffic modeling and prediction service based on the Fog Computing paradigm. The work assumes an scenario in which a number of distributed antennas receive data generated by vehicles across the city. In the Fog nodes data is collected, processed in local and intermediate nodes, and finally forwarded to a central Cloud location for further analysis. We propose a combination of a data distribution algorithm, resilient to back-haul connectivity issues, and a traffic modeling approach based on deep learning techniques to provide distributed traffic forecasting capabilities. In our experiments, we leverage real traffic logs from one week of Floating Car Data (FCD) generated in the city of Barcelona by a road-assistance service fleet comprising thousands of vehicles. FCD was processed across several simulated conditions, ranging from scenarios in which no connectivity failures occurred in the Fog nodes, to situations with long and frequent connectivity outage periods. For each scenario, the resilience and accuracy of both the data distribution algorithm, and the learning methods were analyzed. Results show that the data distribution process running in the Fog nodes is resilient to back-haul connectivity issues and is able to deliver data to the Cloud location even in presence of severe connectivity problems. Additionally, the proposed traffic modeling and forecasting method exhibits better behavior when run distributed in the Fog instead of centralized in the Cloud, especially when connectivity issues occur that force data to be delivered out of order to the Cloud.This project is partially supported by the European Research Council (ERC), Spain under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595).
It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya, Spain under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). The authors gratefully acknowledge the Reial Automvil Club
de Catalunya (RACC) for the dataset of Floating Car Data provided.Peer ReviewedPostprint (published version
Improving Maritime Traffic Emission Estimations on Missing Data with CRBMs
Maritime traffic emissions are a major concern to governments as they heavily
impact the Air Quality in coastal cities. Ships use the Automatic
Identification System (AIS) to continuously report position and speed among
other features, and therefore this data is suitable to be used to estimate
emissions, if it is combined with engine data. However, important ship features
are often inaccurate or missing. State-of-the-art complex systems, like CALIOPE
at the Barcelona Supercomputing Center, are used to model Air Quality. These
systems can benefit from AIS based emission models as they are very precise in
positioning the pollution. Unfortunately, these models are sensitive to missing
or corrupted data, and therefore they need data curation techniques to
significantly improve the estimation accuracy. In this work, we propose a
methodology for treating ship data using Conditional Restricted Boltzmann
Machines (CRBMs) plus machine learning methods to improve the quality of data
passed to emission models. Results show that we can improve the default methods
proposed to cover missing data. In our results, we observed that using our
method the models boosted their accuracy to detect otherwise undetectable
emissions. In particular, we used a real data-set of AIS data, provided by the
Spanish Port Authority, to estimate that thanks to our method, the model was
able to detect 45% of additional emissions, of additional emissions,
representing 152 tonnes of pollutants per week in Barcelona and propose new
features that may enhance emission modeling.Comment: 12 pages, 7 figures. Postprint accepted manuscript, find the full
version at Engineering Applications of Artificial Intelligence
(https://doi.org/10.1016/j.engappai.2020.103793
Challenges and Opportunities for RISC-V Architectures towards Genomics-based Workloads
The use of large-scale supercomputing architectures is a hard requirement for
scientific computing Big-Data applications. An example is genomics analytics,
where millions of data transformations and tests per patient need to be done to
find relevant clinical indicators. Therefore, to ensure open and broad access
to high-performance technologies, governments, and academia are pushing toward
the introduction of novel computing architectures in large-scale scientific
environments. This is the case of RISC-V, an open-source and royalty-free
instruction-set architecture. To evaluate such technologies, here we present
the Variant-Interaction Analytics use case benchmarking suite and datasets.
Through this use case, we search for possible genetic interactions using
computational and statistical methods, providing a representative case for
heavy ETL (Extract, Transform, Load) data processing. Current implementations
are implemented in x86-based supercomputers (e.g. MareNostrum-IV at the
Barcelona Supercomputing Center (BSC)), and future steps propose RISC-V as part
of the next MareNostrum generations. Here we describe the Variant Interaction
Use Case, highlighting the characteristics leveraging high-performance
computing, indicating the caveats and challenges towards the next RISC-V
developments and designs to come from a first comparison between x86 and RISC-V
architectures on real Variant Interaction executions over real hardware
implementations
The holistic perspective of the INCISIVE project : artificial intelligence in screening mammography
Finding new ways to cost-effectively facilitate population screening and improve cancer diagnoses at an early stage supported by data-driven AI models provides unprecedented opportunities to reduce cancer related mortality. This work presents the INCISIVE project initiative towards enhancing AI solutions for health imaging by unifying, harmonizing, and securely sharing scattered cancer-related data to ensure large datasets which are critically needed to develop and evaluate trustworthy AI models. The adopted solutions of the INCISIVE project have been outlined in terms of data collection, harmonization, data sharing, and federated data storage in compliance with legal, ethical, and FAIR principles. Experiences and examples feature breast cancer data integration and mammography collection, indicating the current progress, challenges, and future directions
When and How to Apply Statistics, Machine Learning and Deep Learning Techniques
Machine Learning has become 'commodity' in engineering and experimental sciences, as calculus and statistics did before. After the hype produced during the 00's, machine learning (statistical learning, neural networks, etc.) has become a solid and reliable set of techniques available to the general researcher population to be included in their common procedures, far from the mysticism surrounding this field when only ML experts could solve modeling and prediction problems using such novel algorithms. But while knowledge on this field has settled among professionals, novice ML users still have trouble to decide when determined techniques could and should be applied to solve a given problem, sometimes ending with over-complicated solutions for simplistic problems, or complex problems partially solved by simplistic methods. This tutorial wants to introduce the most common techniques on statistical learning and neural networks, towards showing the proper techniques for each given scenario.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595).Peer Reviewe